Primary Goal:
Perform QC analysis on a single cell data set using the R Seurat package.
Perform QC analysis on a single cell data set using the R Seurat package.
Use the Athena terminal to set up your lab directory and obtain needed files.
# create and navigate into a new directory for this lab
mkdir scRNASeq
cd scRNASeq
# Copy feature count directory
cp -r /lustre/home/cctr691/scRNASeq/filtered_feature_bc_matrix_WHIM2_106361_HumanOnly .
# Copy supporting files
cp /lustre/home/cctr691/scRNASeq/MitoCodingGenes13_human.txt .
cp /lustre/home/cctr691/scRNASeq/UCD52CR_107086_web_summary.html .
The majority of this lab will be completed in the Rmarkdown template EXCEPT this first section.
Download the “UCD52CR_107086_web_summary.html” file to your laptop and open it in a web browser, then answer the first set of questions in the RMarkdown template.
Q1) Was this sample aligned to a SINGLE or MULTIPLE genomes?
SINGLE
Q2) With the knowledge that we expect about 5000 cells from this sample, list at least 2 red flags identified by this report. Be sure to explore both tabs.
Red Flag 1: LOW READ COUNTS
Red Flag 2: Valid barcodes are below the threshold
Follow the instructions and code to remove dead cells, generate violin plots, cluster, and create tSNE plots of the sample data.
The sample data you are using is publicly available on GEO (Gene Expression Omnibus) in the Series GSE174391, only it is merged with other samples. For this lab I am providing you with the single WHIM2 PDX sample that has already had the mouse cells removed (so you only have human data).
Also, to get help using any command, type “?” followed by the command to see the documentation. E.g. “?plot”
Follow the steps below and answer the questions. This lab is based off of, but not exactly the same as, the Seurat introductory tutorial. Feel free to [explore that tutorial for more in-depth information](libra
Edit the setwd() command to point to YOUR lab directory! ### Load libraries
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Load the WHIM2 dataset into Seurat:
whim.data <- Read10X(data.dir = "./scRNASeq/filtered_feature_bc_matrix_WHIM2_106361_HumanOnly")
#Initialize the Seruat Object:
whim <- CreateSeuratObject(counts = whim.data, project = "whim2", min.cells = 3, min.features = 200)
#View a summary of the object:
whim
## An object of class Seurat
## 19612 features across 2575 samples within 1 assay
## Active assay: RNA (19612 features, 0 variable features)
## 1 layer present: counts
In the environment tab of the RStudio interface, click on the “whim” object to expand and explore it. Look for the “meta.data” section. This is the most important section as it is where your new annotations are stored, and it tells you the name of annotations added by clustering and other processing.
Q3) What metadata elements are listed in this section?
orig.ident, nCount_RNA, nFeature_RNA, percent.mt
Q4) I told you this is 1 PDX sample, but the summary of the Seurat object says differently. How many samples does this object have? What is this actually telling you? What does the 19,612 number represent?
The object consists of 19,612 individual cell samples. However, all of these cells come from one PDX sample (as indicated by the
orig.identmetadata, labeled “whim2”). The number 19,612 represents the total count of cells in the dataset, rather than the number of biological samples. Although the data evolves from a single PDX sample, the single-cell sequencing captures thousands of individual cells, with each cell treated independently during the analysis.
# Load in the mitochondrial gene IDs:
mitogene_ids <- read.delim("./scRNASeq/MitoCodingGenes13_human.txt", header = FALSE, stringsAsFactors = FALSE)[[1]]
# Add the percentage of mito expression as an annotation to your Seurat object:
whim[["percent.mt"]] <- PercentageFeatureSet(whim, features = mitogene_ids)
# Create a Violin plot of the mito expression, nFeature and nCount data
# These last 2 were already in the dataset by default, so we didn’t need to add them.
VlnPlot(whim, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = .4)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
Q5) What is a reasonable filtering criteria for the percent of mitochondria expression?
Filtering out 10-15% could exclude poor quality cells based on the outliers in the above plots.
Sub set your cells to remove those that do not pass your mitochondrial cutoff. Make sure to replace the “ADD_YOUR_CUTOFF_HERE” with the appropriate expression (i.e. “> x” or “< x” where “x” is the threshold you specified in Q5).
# subset the seurat object for only cells passing your cutoff and print out the new violin plot
# Filter out cells with more than 15% mitochondrial gene expression
whim_filtered <- subset(whim, subset = percent.mt < 15)
# print new violin plot
VlnPlot(whim_filtered, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = .4)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
Note that we now have two (2) Seurat data sets, “whim” contains the full unfiltered dataset, and “whim_filtered” contains the filtered dataset.
Q6) How many cells survived the filtering? Print out the new “whim_filtered” object to figure this out.
ANSWER HERE
# Load the Seurat library
library(Seurat)
# Now try reading your data
whim.data <- Read10X(data.dir = "./scRNASeq/filtered_feature_bc_matrix_WHIM2_106361_HumanOnly")
# Initialize the Seurat Object
whim <- CreateSeuratObject(counts = whim.data, project = "whim2", min.cells = 3, min.features = 200)
# Load the WHIM2 dataset
whim.data <- Read10X(data.dir = "./scRNASeq/filtered_feature_bc_matrix_WHIM2_106361_HumanOnly")
# Initialize the Seurat object
whim <- CreateSeuratObject(counts = whim.data, project = "whim2", min.cells = 3, min.features = 200)
# subset the seurat object for only cells passing your cutoff and print out the new violin plot
# Load in the mitochondrial gene IDs:
mitogene_ids <- read.delim("./scRNASeq/MitoCodingGenes13_human.txt", header = FALSE, stringsAsFactors = FALSE)[[1]]
# Add the percentage of mito expression as an annotation to your Seurat object:
whim[["percent.mt"]] <- PercentageFeatureSet(whim, features = mitogene_ids)
# Create a Violin plot of the mito expression, nFeature and nCount data
# These last 2 were already in the dataset by default, so we didn’t need to add them.
VlnPlot(whim, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = .4)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
# Filter out cells with more than 15% mitochondrial gene expression
whim_filtered <- subset(whim, subset = percent.mt < 15)
# print new violin plot
VlnPlot(whim_filtered, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = .4)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
# View the filtered Seurat object
whim_filtered
## An object of class Seurat
## 19612 features across 470 samples within 1 assay
## Active assay: RNA (19612 features, 0 variable features)
## 1 layer present: counts
Q7) Compare the 2 plots, before and after. What can you conclude about the cells that did not make the mitochondrial cutoff in terms of nFeature and nCount? If you want to print both plots in the same R code chunk to compare more easily you can do that.
ANSWER HERE
# Before filtering
VlnPlot(whim, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0.4)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
# After filtering
VlnPlot(whim_filtered, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = .4)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
You can continue to filter on nFeature and nCount, and I generally do, but for the purposes of this lab we will leave the dataset as-is and move on to normalization and clustering.
Before we can cluster the data and create those cool tSNE and UMAP visualizations, we need to normalize and scale the data, and run a PCA analysis. However, this can take a long time if we use all genes. Thus, we will reduce the set used to the most variable genes (don’t worry, the rest are still there:).
# Using the “whim_filtered” Seurat data set, normalize the data.
whim_filtered <- NormalizeData(whim_filtered)
## Normalizing layer: counts
# Before creating the visualizations, we need to reduce the dataset down to only
# those genes with the most variance. These will be our top 2000 most variable features.
whim_filtered <- FindVariableFeatures(whim_filtered, selection.method = "vst", nfeatures = 2000)
## Finding variable features for layer counts
top10 <- head(VariableFeatures(whim_filtered), 10)
# Plot the top10 most variable genes as a scatter plot and mark the top 10
plot1 <- VariableFeaturePlot(whim_filtered)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot2
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Q8) Which gene is the most variable?
SCGB2A2.
What does scaling do? * Shifts the expression of each gene, so that the mean expression across cells is 0 * Scales the expression of each gene, so that the variance across cells is 1 * This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
Principle Component Analysis is a dimension reduction technique that transforms large data sets with a lot of dimensions into a data set with fewer dimensions, but still retains the majority of the information present, such as variability, in the initial data set. In single cell RNASeq it is used as a first step to identify the most informative principle components that should be used for data visualization by UMAP and tSNE.
# scale data
whim_filtered <- ScaleData(whim_filtered)
## Centering and scaling data matrix
# run PCA
whim_filtered <- RunPCA(whim_filtered, features = VariableFeatures(object = whim_filtered))
## PC_ 1
## Positive: NEAT1, KCNQ1OT1, OLMALINC, SVIL, SOX4, ELF3, MALAT1, FLNB, P4HA2, IER3
## NR4A1, KLF6, BNIP3, ZFP36L1, AHNAK, VEGFA, ERRFI1, KLHL24, LINC00342, LPIN1
## HERPUD1, MIR205HG, ZFP36L2, NUCB2, BNIP3L, SAT1, MCL1, MFGE8, TMEM106C, FBXO32
## Negative: SCGB3A1, MTRNR2L6, MTRNR2L10, ZG16B, H2AFZ, CCDC102B, H19, ARHGAP11A, S100A9, CDCA8
## AP001347.1, HIST2H2AC, AC109460.2, PRDM12, AEBP1, TRAF3IP3, AC011462.5, CREB3L1, AC008496.3, MYH11
## FMO1, WDR5, AL391244.2, CRYBA2, KCTD14, WDR76, FABP3, KCNE5, KIF18A, SPC25
## PC_ 2
## Positive: CENPF, TUBA1B, KIFC1, ASPM, UBE2T, STMN1, UBE2C, TPX2, NUSAP1, UBE2S
## HMGN2, KIF23, ESCO2, DLGAP5, H2AFZ, KPNA2, CDC20, ORC6, CCNB1, CDKN3
## BIRC5, CENPA, MKI67, MAD2L1, PTTG1, NUF2, TOP2A, HMGB1, KIF4A, NASP
## Negative: RRAGD, ERO1A, EGLN3, NDRG1, PFKP, CEBPD, CAMK1D, SAT1, IGFBP5, IER3
## EMP1, ANXA3, LOX, SLC2A1, MIR210HG, ORAI3, VEGFA, SCGB2A2, STC1, ANKRD37
## DUSP4, OPTN, EGFR, FZD8, USP53, BHLHE40, SLC16A3, TMC4, DNAH11, COL6A2
## PC_ 3
## Positive: ZG16B, KRTAP5-10, CALML5, RGS2, ZFP36L2, RCSD1, WNK4, PLA2R1, SORBS2, NPY1R
## ROPN1, KLF6, SCGB3A1, CCL28, KCNQ1OT1, CLDN8, EFNA1, MT-ND6, AARD, ZFP36
## LINC02532, SELENBP1, AC007906.2, NEAT1, MUC1, AZGP1, TMEM139, NCAN, SLC12A2, TPT1-AS1
## Negative: ERO1A, PFKP, CPA4, EGLN3, LOX, INHBB, SPAG4, IGFBP5, COL23A1, CAMK1D
## AL590004.3, SYDE1, EEF1A2, CCN4, SLC16A3, FOXQ1, MAP1B, AL133453.1, HACD1, ENO2
## STC1, LINC02570, MT2A, TOP2A, UBE2C, PTGS1, HMGB2, RRAGD, UPP1, CCDC151
## PC_ 4
## Positive: AC069185.1, LINC02256, AL135791.1, AL451085.2, Z99127.4, AP001453.2, AC004839.2, AC084262.2, LGR6, SDR42E2
## TENT5A, APOLD1, HIST1H2AE, HIST1H2BH, PCDH10, HIST1H2BN, HIST1H3G, PKP2, AC144450.1, STMN3
## AL138824.1, HIST1H2AK, POU3F1, HIST1H2BL, S1PR3, HIST1H2AH, IFI30, PLCB1, TEF, HIST1H2AG
## Negative: EMP1, KRT16, ANXA1, TMPRSS3, KLK7, KRT14, MSLN, INHBA, TMEM51, KRT17
## LGALS3, AC002511.2, RASD1, GDPD3, S100P, LEMD1, KLK10, RAB11FIP1, KLK6, ADD3
## PXDC1, DEFB1, PLSCR1, SOCS1, PLAUR, UBE2C, ANXA3, MAP3K8, SPON2, PPL
## PC_ 5
## Positive: AL365436.2, SGK1, KIFC1, Z99127.4, RND3, SRGAP2, HIST1H3G, HIST1H2AH, APOLD1, HIST1H2AG
## CLDN8, GTSE1, HIST1H2BN, NPAS2, ADD3, CBX4, AC069185.1, AL121944.1, EMP1, PLXNA2
## IRF1, HIST1H2BL, CDKN2B, FOSB, ODF3B, KRT16, AC004839.2, KLF5, PCDH10, PLCG2
## Negative: KRT86, CHI3L1, AKR1C2, CD5L, MFGE8, ITGBL1, KRT81, MAP9, ORM2, MSMO1
## RASSF2, IL34, NRN1, CRABP1, LOX, OAZ3, NUCB2, GPX8, S100A9, FAM162A
## NRG1, SPARC, S100A16, NPY1R, LDHB, ACAT2, HSD17B7, SCD, S100A8, THBS3
# create heatmap
DimHeatmap(whim_filtered, dims = 1:15, cells = 500, balanced = TRUE)
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
## Warning: Requested number is larger than the number of available items (470).
## Setting to 470.
Q9) What are the axes of each heatmap?
X axis- cells and Y-axis- Genes corresponding to PC
Q10) How many cells are used in each heatmap?
approx 500 i think
Q11) Describe what happens when the principal component (PC) number gets higher.
There will be change in patterns within the heatmatp (they become distinct to each other)
To visualize single cell data we need to use enough PCs to represent
the variability in the data without adding too much noise or having too
little. An Elbow plot can help with this along with looking at the
DimHeatmap plots. We need to pick a threshold close to the elbow in the
plot. This indicates where not much more information is being
added.
For the plot generated by the command below it looks to be around 5;
however, the default is usually 15, which is a good cutoff for most
datasets.
# draw elbow plot
ElbowPlot(whim_filtered)
#Create a tSNE plot using the first 5 dimensions.
whim_filtered <- RunTSNE(whim_filtered, dims = 1:5)
DimPlot(whim_filtered, reduction = "tsne")
# add additional code here based on Question 12
Q12) In the code block above, copy the 2 lines of code that generated the tSNE plot to answer the following questions. Note I am expecting at least 4 plots.
What happens when you use fewer dimensions?
DESCRIPTIVE ANSWER HERE
# draw elbow plot
ElbowPlot(whim_filtered)
#Create a tSNE plot using the first 5 dimensions.
whim_filtered <- RunTSNE(whim_filtered, dims = 1:1)
DimPlot(whim_filtered, reduction = "tsne")
#Create a tSNE plot using lesser dimensions.
whim_filtered <- RunTSNE(whim_filtered, dims = 1:2)
DimPlot(whim_filtered, reduction = "tsne")
whim_filtered <- RunTSNE(whim_filtered, dims = 1:3)
DimPlot(whim_filtered, reduction = "tsne")
whim_filtered <- RunTSNE(whim_filtered, dims = 1:4)
DimPlot(whim_filtered, reduction = "tsne")
How about only the first dimension?
DESCRIPTIVE ANSWER HERE Only genes that are dominant will be choosen for the plot. Genes with suttle differences will not be shown.
What happens when you use more dimensions?
More clusters of both predominant and lesser important cells will be shown with improved seperation of various populations which makes it very clear to distinguish and identify the intesnity of gene expression.
This flat pink plot is pretty boring. Let’s add some color by coloring each cell with the expression level of the 2 top most variable genes you identified in Q8.
# change dimensions back to 1:5
whim_filtered <- RunTSNE(whim_filtered, dims = 1:5)
# create the feature plot
FeaturePlot(whim_filtered, features = c("S100A4", "IGFBP5"), reduction = "tsne")
You should get 2 plots with the most highly expressed gene grouped into a single location. This could be a cluster of cells with similar gene expression profiles. To find out, let’s cluster the dataset, and then color the tSNE plot by cluster.
# find nearest neighbors and cluster cells
whim_filtered <- FindNeighbors(whim_filtered, dims = 1:5)
## Computing nearest neighbor graph
## Computing SNN
whim_filtered <- FindClusters(whim_filtered, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 470
## Number of edges: 13675
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7971
## Number of communities: 5
## Elapsed time: 0 seconds
# overlay cluster colors onto dim plot
DimPlot(whim_filtered, reduction = "tsne", group.by = "seurat_clusters")
Q13) Which cluster includes the 2 genes you plotted in Q11?
I think its either cluster 2/3 as they have overlapping blue colors in between based on the tSNE plot
Now we can explore each gene one at a time, which is not very
efficient.
Next we will run an all-by-all differential expression analysis to
identify the top 10 gene markers for each cluster.
library(Seurat)
library(dplyr)
# find marker genes for each cluster
whim_filtered.markers <- FindAllMarkers(whim_filtered, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
# identify the top 20 marker genes for each cluster
whim_filtered.markers %>% group_by(cluster) %>% top_n(n = 20, wt = avg_log2FC) -> top20
# plot marker gene expression for each cluster
DoHeatmap(whim_filtered, features = top20$gene)
## Warning in DoHeatmap(whim_filtered, features = top20$gene): The following
## features were omitted as they were not found in the scale.data slot for the RNA
## assay: REEP1, POC5, CLMP, CNTRL, WDR62, FBXO43, AC091057.6, ABCG1, B4GALT2,
## ELOF1, FAM172A, ZFAND2B, HERPUD2, CHCHD4, PARVB, ACAD9, RCC1, PARN, RAB15,
## TRMU, CHAF1A, WWC1, PPP1R13B, UTP3, FBRS, SLC43A3, TRIM4, CCDC51, UBL3,
## ZNF385A, WDR35, ZMIZ1, FKBPL, IQANK1, RAPGEF1, SLC35B2, MYO1E, PDCL, ZNF239,
## NOP14, ANKRD36, GPKOW, RMND5B, TNKS2, ATP13A2, NOP9, LINC01554, SNHG21,
## PLEKHF1, CAMKK1, NAA40, EIF2AK3, MAGEH1, C1orf216, LINC01534, ELL2, ST3GAL6,
## POLRMT, SETD4, VASN, ACER3
Q14) What is the x-axis for in this plot?
ANSWER HERE Different cluster of cells seperated by a coloumn
Q15) Which 2 clusters have the most distinct expression profile from the rest of the clusters?
ANSWER HERE cluster 0 and 3 seem to be very different from each other similar to 2 and 4.
Finally, let’s export your list of differentially expressed genes to
a text file.
First filter base on significant p-value, then save to a file. Column
explanations are as follows:
# filter on adjusted pval
whim_filtered.markers <- whim_filtered.markers[whim_filtered.markers$p_val_adj <= 0.001,]
# save to a CSV file
write.csv(whim_filtered.markers, file="whim_filtered_SignificantMarkerGenes.csv", quote = FALSE)
Either through Athena, or by downloading this file, open the CSV file and answer the following question.
Q16) What is the log2FC of the top most variable gene you identified in Q8?
ANSWER HERE 3.945935221
Bonus Q1) Pick one or more genes from the heatmap, any genes you want. Modify the violin plot command from the QC section by replacing the “nFeature_RNA”, “nCount_RNA”, and “percent.mt” with your gene names (delete or add entries as needed). Adjust the “n_col” attribute as needed, and add in the “group_by” statement from step 9 above to group the plots by cluster.
# Bonus Q1 code goes here
Bonus Q2) Plot the tSNE map, only color it by mitochondrial expression. Use the “cols” attribute to change the color scale (hint, look in the documentation for the command to see its usage).
# Bonus Q2 code goes here
Finally, save this file, then knit your HTML report for submission
via GitHub.
Go back to the Word document for github instructions!